Unsupervised Decomposition of a Document into Authorial Components
نویسندگان
چکیده
We propose a novel unsupervised method for separating out distinct authorial components of a document. In particular, we show that, given a book artificially “munged” from two thematically similar biblical books, we can separate out the two constituent books almost perfectly. This allows us to automatically recapitulate many conclusions reached by Bible scholars over centuries of research. One of the key elements of our method is exploitation of differences in synonym choice by different authors.
منابع مشابه
Unsupervised Multi-Author Document Decomposition Based on Hidden Markov Model
This paper proposes an unsupervised approach for segmenting a multiauthor document into authorial components. The key novelty is that we utilize the sequential patterns hidden among document elements when determining their authorships. For this purpose, we adopt Hidden Markov Model (HMM) and construct a sequential probabilistic model to capture the dependencies of sequential sentences and their...
متن کاملUnsupervised Decomposition of a Multi-Author Document Based on Naive-Bayesian Model
This paper proposes a new unsupervised method for decomposing a multi-author document into authorial components. We assume that we do not know anything about the document and the authors, except the number of the authors of that document. The key idea is to exploit the difference in the posterior probability of the Naive-Bayesian model to increase the precision of the clustering assignment and ...
متن کاملUnsupervised Authorial Clustering Based on Syntactic Structure
This paper proposes a new unsupervised technique for clustering a collection of documents written by distinct individuals into authorial components. We highlight the importance of utilizing syntactic structure to cluster documents by author, and demonstrate experimental results that show the method we outline performs on par with state-of-the-art techniques. Additionally, we argue that this fea...
متن کاملOverview of the Author Identification Task at PAN-2017: Style Breach Detection and Author Clustering
Several authorship analysis tasks require the decomposition of a multiauthored text into its authorial components. In this regard two basic prerequisites need to be addressed: (1) style breach detection, i.e., the segmenting of a text into stylistically homogeneous parts, and (2) author clustering, i.e., the grouping of paragraph-length texts by authorship. In the current edition of PAN we focu...
متن کاملClustering by Authorship Within and Across Documents
The vast majority of previous studies in authorship attribution assume the existence of documents (or parts of documents) labeled by authorship to be used as training instances in either closed-set or open-set attribution. However, in several applications it is not easy or even possible to find such labeled data and it is necessary to build unsupervised attribution models that are able to estim...
متن کامل